Goto

Collaborating Authors

 Translational Bioinformatics


China races to build record biobank to rival U.S. drugs research

The Japan Times

China races to build record biobank to rival U.S. drugs research Biobanks store masses of biomedical data such as clinical records, genome sequences and other long-term health metrics that research and drug development depend on. As a fledgling researcher in U.S., Zhang Li was struck by the efficiency of extracting human tissue in the morning and mining it for data the same afternoon. Such a streamlined process had been missing from his years of training as a bio data scientist in China. Inspired, he returned home to Beijing to join the Chinese Institute for Brain Research and launch a national database that will collect blood and DNA samples from 33,000 children to help identify patterns of brain disease and their risk factors. "Biomedical data is extremely valuable and is fundamental for us to find solutions to diseases and to delay aging," said Zhang, surrounded by robotic arms carefully organizing blood samples.


Biconvex Biclustering

Rosen, Sam, Chi, Eric C., Xu, Jason

arXiv.org Machine Learning

This article proposes a biconvex modification to convex biclustering in order to improve its performance in high-dimensional settings. In contrast to heuristics that discard a subset of noisy features a priori, our method jointly learns and accordingly weighs informative features while discovering biclusters. Moreover, the method is adaptive to the data, and is accompanied by an efficient algorithm based on proximal alternating minimization, complete with detailed guidance on hyperparameter tuning and efficient solutions to optimization subproblems. These contributions are theoretically grounded; we establish finite-sample bounds on the objective function under sub-Gaussian errors, and generalize these guarantees to cases where input affinities need not be uniform. Extensive simulation results reveal our method consistently recovers underlying biclusters while weighing and selecting features appropriately, outperforming peer methods. An application to a gene microarray dataset of lymphoma samples recovers biclusters matching an underlying classification, while giving additional interpretation to the mRNA samples via the column groupings and fitted weights.



Appendix ProteinShake: Building datasets and benchmarks for deep learning on protein structures

Neural Information Processing Systems

Table 3: Comparison of models trained with different representations of protein structure across various tasks, on a random data split . The optimal choice of representation depends on the task. Shown are mean and standard deviation across four runs with different seeds. Table 4: Comparison of models trained with different representations of protein structure across various tasks, on a sequence data split . Table 5: Comparison of models trained with different representations of protein structure across various tasks, on a structure data split .




Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA

Neural Information Processing Systems

Foundation models have made significant strides in understanding the genomic language of DNA sequences. However, previous models typically adopt the tok-enization methods designed for natural language, which are unsuitable for DNA sequences due to their unique characteristics. In addition, the optimal approach to tokenize DNA remains largely under-explored, and may not be intuitively understood by humans even if discovered. To address these challenges, we introduce MxDNA, a novel framework where the model autonomously learns an effective DNA tokenization strategy through gradient decent.


Efficient and Effective Optimal Transport-Based Biclustering: Supplementary Material

Neural Information Processing Systems

Z that represents some transfer of mass between elements of w and v . The proof is the same for W . Proposition 2. Suppose that the target row and column representative distributions are the same, The the Kantorovich OT problem and whose rank is at most min(rank(Z), rank( W)) . Proof of proposition 2. From linear algebra, we have that Proof of proposition 3. We suppose that The optimal transport problem can be formulated and solved as the Earth Mover's Distance (EMD) We report the biclustering performance on the synthetic datasets in table 2. At least one of our models finds the perfect partition in all cases. The gene-expression matrices used are the Cumida Breast Cancer and Leukemia datasets. Their characteristics are shown in Table 3. Table 3: Characteristics of the gene expression datasets.